*adding a new sentence here and resubmitting this to see if it works on the blog. *resubmit attempt 2.
Brooklyn Nine-Nine, which is about a police precinct in New York City, is one of my favourite TV shows. It made me want to learn more about actual crime statistics in the city.
Specifically, this project examines how crime intersects with other social issues. The disparities in arrest rates across demographic groups have been well-documented (Jahn et al., 2022; Schleiden et al., 2020). Past research has also revealed a positive association between crime and drug usage (Pierce et al., 2017).
Research Questions
What is the demographic profile of those arrested for drug-related crimes? Does it differ from that of general offenders?
Suppose you were walking through a borough in New York City, and you see syringe litter. Does this necessarily mean that the borough in question is unsafe – or is this merely a bias that we have? In simpler terms: where are syringes likely to be found? Is there an overlap with where crimes (especially drug-related crimes) occur?
Crime is operationalized as arrestdata (New York Police Department, 2022a; New York Police Department, 2022b), and drug usage as syringe litter data (NYC Parks, 2022a). The data spans from 2017 to 2022. Codebooks are provided here (New York Police Department, 2022a) and here (NYC Parks, 2022b).
Generated by summarytools 1.0.1 (R version 4.2.1) 2022-09-01
#read in shapefile for syringe dataset.park <-st_read("_data/ParksProperties", quiet =TRUE) %>%select("gispropnum", "geometry")
In the code chunk above, all datasets have been loaded, unnecessary columns removed, columns renamed where needed and all missing values labelled NA. Some missing values were labelled “9” in arrest_ytd and have also been converted to NA.
arrest_historic contains 5,308,876 rows, while arrest_ytd contains 93,238 rows. Each row represents a single arrest and associated information (e.g., description of crime, arrest location, demographic profile of suspect). Both have the same 9 columns and only differ in time frame - arrest_historic contains data from Jan 2006 to Dec 2021, while arrest_ytd contains data from Jan to Jun 2022.
I will interpret their descriptive statistics again after combining them, but these are some initial observations:
Dangerous drugs are in the list of top 5 offenses in both datasets, highlighting the importance of our research question.
Arrests seem to occur the least often in Staten Island - it would be interesting to see if this is where the least number of syringes are collected as well.
Black individuals represent ~50% of all arrests in both datasets, despite only being ~20% of the population in NYC (U.S. Census Bureau, 2022).
Meanwhile, syringe contains 17,531 rows and 9 columns. Rows where s_totalsyringes were 0 were removed in the read-in step, as they do not contain meaningful information for analysis. The dataset ranges from Jan 2017 to Jul 2022. It contains information on syringe collection, including date, location and number of syringes. Each row does not yet represent a single date/location combination where syringes were collected, so it needs to be tidied before descriptives can be interpreted. The one thing that stands out to me is that no syringes were collected in Brooklyn - hopefully the analysis later can unpack why this might be the case.
I have also loaded the park shapefile (Department of Parks and Recreation, 2022), which will be used derive location coordinates for syringe.
Tidy Data
Steps to tidy arrest_historic and arrest_ytd:
Convert date to “date” column type.
Create a separate column, year.
Remove rows before 2017 in arrest_historic.
Join both datasets into arrest dataset (93,238 + 1,043,535 = 1,136,773 rows; 9 + 1 = 10 columns).
# change 'date' to date column type.arrest_historic <- arrest_historic %>%mutate(date =as_date(parse_date_time(date, c('mdy'))))arrest_ytd <- arrest_ytd %>%mutate(date =as_date(parse_date_time(date, c('mdy'))))
# create 'year'.arrest_historic$"year"<-year(arrest_historic$date)arrest_ytd$"year"<-year(arrest_ytd$date)# remove rows before 2017 in arrest_historic.arrest_historic <- arrest_historic %>%filter(year >2016)# join datasets.arrest <-rbind(arrest_historic, arrest_ytd)# spell out borough names in full.arrest$a_boro <-recode(arrest$a_boro, B ='Bronx', S ='Staten Island', K ='Brooklyn', M ='Manhattan', Q ='Queens')# sanity check.print(dfSummary(arrest, varnumbers =FALSE, plain.ascii =FALSE, graph.magnif =0.30, style ="grid", valid.col =FALSE), method ='render', table.classes ='table-condensed')
Data Frame Summary
arrest
Dimensions: 1136773 x 10
Duplicates: 97724
Variable
Stats / Values
Freqs (% of Valid)
Graph
Missing
date
[Date]
min : 2017-01-01
med : 2019-02-25
max : 2022-06-30
range : 5y 5m 29d
2007 distinct values
0
(0.0%)
a_desc
[character]
1. ASSAULT 3 & RELATED OFFEN
2. DANGEROUS DRUGS
3. PETIT LARCENY
4. FELONY ASSAULT
5. VEHICLE AND TRAFFIC LAWS
6. MISCELLANEOUS PENAL LAW
7. ROBBERY
8. GRAND LARCENY
9. DANGEROUS WEAPONS
10. CRIMINAL MISCHIEF & RELAT
[ 75 others ]
166190
(
14.6%
)
129285
(
11.4%
)
102593
(
9.0%
)
83592
(
7.4%
)
67024
(
5.9%
)
62845
(
5.5%
)
48499
(
4.3%
)
47368
(
4.2%
)
46133
(
4.1%
)
32889
(
2.9%
)
348324
(
30.7%
)
2031
(0.2%)
a_offenselevel
[character]
1. F
2. I
3. M
4. V
436854
(
38.7%
)
2838
(
0.3%
)
675669
(
59.9%
)
12874
(
1.1%
)
8538
(0.8%)
a_boro
[character]
1. Bronx
2. Brooklyn
3. Manhattan
4. Queens
5. Staten Island
255022
(
22.4%
)
312014
(
27.4%
)
288631
(
25.4%
)
232670
(
20.5%
)
48436
(
4.3%
)
0
(0.0%)
a_age
[character]
1. <18
2. 18-24
3. 25-44
4. 45-64
5. 65+
52411
(
4.6%
)
239476
(
21.1%
)
606859
(
53.4%
)
223661
(
19.7%
)
14366
(
1.3%
)
0
(0.0%)
a_sex
[character]
1. F
2. M
200375
(
17.6%
)
936398
(
82.4%
)
0
(0.0%)
a_race
[character]
1. AMERICAN INDIAN/ALASKAN N
2. ASIAN / PACIFIC ISLANDER
3. BLACK
4. BLACK HISPANIC
5. UNKNOWN
6. WHITE
7. WHITE HISPANIC
3094
(
0.3%
)
61056
(
5.4%
)
548316
(
48.2%
)
98725
(
8.7%
)
7271
(
0.6%
)
133286
(
11.7%
)
285025
(
25.1%
)
0
(0.0%)
a_lat
[numeric]
Mean (sd) : 40.7 (0.1)
min ≤ med ≤ max:
40.5 ≤ 40.7 ≤ 40.9
IQR (CV) : 0.1 (0)
93984 distinct values
0
(0.0%)
a_long
[numeric]
Mean (sd) : -73.9 (0.1)
min ≤ med ≤ max:
-74.3 ≤ -73.9 ≤ -73.7
IQR (CV) : 0.1 (0)
93555 distinct values
0
(0.0%)
year
[numeric]
Mean (sd) : 2018.9 (1.6)
min ≤ med ≤ max:
2017 ≤ 2019 ≤ 2022
IQR (CV) : 3 (0)
2017
:
286225
(
25.2%
)
2018
:
246773
(
21.7%
)
2019
:
214617
(
18.9%
)
2020
:
140413
(
12.4%
)
2021
:
155507
(
13.7%
)
2022
:
93238
(
8.2%
)
0
(0.0%)
Generated by summarytools 1.0.1 (R version 4.2.1) 2022-09-01
The sanity check demonstrates that steps 1 to 4 are complete.
Steps to tidy syringe:
Convert date to “date” type.
Sum syringe count values in syringe into 1 row per date/location combination.
Extract latitude and longitude columns from park shapefile and convert them into a dataframe.
Combine that dataframe with syringe.
# remove timestamp from 'date'.syringe$date <-str_remove(syringe$date, "12:00:00 AM")# convert 'date' to date column type.syringe <- syringe %>%mutate(date =as_date(parse_date_time(date, c('mdy'))))# sum values into 1 row per date/location combo.syringe <- syringe %>%group_by(s_location, date, gispropnum, s_kiosktype, s_boro, s_kioskavailable) %>%summarise(across(contains("syringes"), ~sum(.x, na.rm =TRUE)))# extract latitude and longitude from park shapefile into park_ll.sf_use_s2(FALSE)park_ll <-st_coordinates(st_centroid(park$geometry))# convert park and park_ll to dataframes.park_ll <-as_tibble(park_ll)park <-as_tibble(park)# remove multipolygon column from park.park <-select(park, -2)# combine park and park_ll.park <-cbind(park, park_ll)# rename long and lat columns in park.park <-rename(park, "s_long"="X", "s_lat"="Y")# combine park and syringe.syringe <- syringe %>%left_join(park, by ="gispropnum")# sanity check.head(syringe)
The sanity check shows that steps 1 to 3 are complete. For now, the basic tidying is complete. Further transformations may be required for each visualization as we go along.
Analysis: Question 1
To recap, this is my first research question:
Question 1
What is the demographic profile of those arrested for drug-related crimes? Does it differ from that of general offenders?
Age, sex and race data are available in the arrest dataset. We can combine them together into 1 column to create demographic profiles, then make bar graphs.
# combine demographics into 1 column.arrest <- arrest %>%unite(demo, a_race, a_sex, a_age, sep ="_", remove =FALSE, na.rm =TRUE)# create bar graph for all arrests.arrest %>%filter(!is.na(demo)) %>%count(demo) %>%slice_max(n, n =5) %>%ggplot(aes(x =reorder(demo,-n/1136773*100), y = n/1136773*100, fill = demo)) +geom_col(stat="identity") +theme_minimal() +labs(title ="Top Demographic Profiles of All Offenders in NYC (Jan 2017 to Jun 2022)", x ="Profile", y ="Percent") +geom_text(aes(label=sprintf("%0.2f", ..y..)), position=position_dodge(width=0.9), vjust=-0.25, size=3) +theme(axis.text.x=element_text(angle=90,hjust=1))
# create bar graph for drug-related arrests.arrest %>%filter(a_desc =="DANGEROUS DRUGS") %>%filter(!is.na(demo)) %>%count(demo) %>%slice_max(n, n =5) %>%ggplot(aes(x =reorder(demo,-n/129285*100), y = n/129285*100, fill = demo)) +geom_col(stat="identity") +theme_minimal() +labs(title ="Top Demographic Profiles of Drug Offenders in NYC (Jan 2017 to Jun 2022)", x ="Profile", y ="Percent") +geom_text(aes(label=sprintf("%0.2f", ..y..)), position=position_dodge(width=0.9), vjust=-0.25, size=3) +theme(axis.text.x=element_text(angle=90,hjust=1))
In the above graphs, each demographic profile is labelled with a different colour, so that we can see the differences in proportion clearly. In the duration of 5 years, the top 5 demographic profiles are the same for all arrests and drug-related arrests (except for a switch in order at the 3rd and 4th positions). This somewhat answers our research question:
Question 1
When comparing all arrests and drug-related arrests, the top five demographic profiles of offenders seem to be largely similar. They are all male. Age group varies based on ethnicity: black men aged 18 to 64 are all quite likely to be arrested, although this was greater for black men aged 25-44. For white Hispanic and white men, arrests are concentrated in the 25-44 age group.
However, both bar graphs seem to show a stark difference between arrests of black males aged 25-44 and the other four groups. I am interested to see if this difference is significant.
# create new dataset for chi-square test.chi_square <- arrest %>%filter(!is.na(demo)) %>%filter(demo =="BLACK_M_18-24"| demo =="BLACK_M_25-44"| demo =="BLACK_M_45-64"| demo =="WHITE HISPANIC_M_25-44"| demo =="WHITE_M_25-44")# run chi-square test.chisq.test(x =table(chi_square$demo))
Chi-squared test for given probabilities
data: table(chi_square$demo)
X-squared = 140432, df = 4, p-value < 2.2e-16
# attempt post-hoc test - this is not working.# table_demo <- chi_square %>% count(demo)# chisq.posthoc.test(table_demo, method = "bonferroni")
The chi-square test reveals significant differences between the top 5 demographic profiles for drug-related and non-drug-related arrests, 𝜒2 (4) = 140432, p < .001. However, this doesn’t tell us which groups had the differences. I downloaded a post-hoc package, but I can’t get it to work.
Additional Analysis: Question 1
Doing the above analysis also made me interested in exploring a few other questions:
How did the proportion of drug-related arrests change over time?
Within drug-related arrests, how did the proportions of each demographic variable (age, sex, race) change over time?
# keep only top 3 levels of a_desc.arrest$a_desc <-fct_lump_n(arrest$a_desc, 3, other_level="Other")# change year to factor type so i can plot it.arrest$year <-as.factor(arrest$year)# stacked bar graph for offense type.arrest %>%filter(!is.na(a_desc)) %>%ggplot() +geom_bar(mapping =aes(x = year, y=..count.., fill=a_desc), position="fill") +theme_minimal() +labs(title ="Proportion of Different Offences in NYC (Jan 2017 to Jun 2022)", x ="Year", y ="Percent") +scale_y_continuous(labels = scales::percent) +theme(axis.text.x=element_text(angle=90,hjust=1))
# stacked bar graph for age.arrest %>%filter(a_desc =="DANGEROUS DRUGS") %>%filter(!is.na(a_age)) %>%ggplot() +geom_bar(mapping =aes(x = year, fill=a_age), position="fill") +theme_minimal() +labs(title ="Change in Age Composition of Drug Offenders in NYC (Jan 2017 to Jun 2022)", x ="Year", y ="Percent") +scale_y_continuous(labels = scales::percent) +theme(axis.text.x=element_text(angle=90,hjust=1))
# stacked bar graph for race.arrest %>%filter(a_desc =="DANGEROUS DRUGS") %>%filter(!is.na(a_race)) %>%ggplot() +geom_bar(mapping =aes(x = year, fill=a_race), position="fill") +theme_minimal() +labs(title ="Change in Race Composition of Drug Offenders in NYC (Jan 2017 to Jun 2022)", x ="Year", y ="Percent") +scale_y_continuous(labels = scales::percent) +theme(axis.text.x=element_text(angle=90,hjust=1))
# stacked bar graph for sex.arrest %>%filter(a_desc =="DANGEROUS DRUGS") %>%filter(!is.na(a_sex)) %>%ggplot() +geom_bar(mapping =aes(x = year, fill=a_sex), position="fill") +theme_minimal() +labs(title ="Change in Sex Composition of Drug Offenders in NYC (Jan 2017 to Jun 2022)", x ="Year", y ="Percent") +scale_y_continuous(labels = scales::percent) +theme(axis.text.x=element_text(angle=90,hjust=1))
This is what we can gather from the stacked bar graphs above. - The proportion of drug-related arrests has shrunk with time - this is a good thing! Although it is a part of the top 3 reasons for arrests, it is definitely on a downward trend. It will be useful to compare this with the change in syringes collected over time, to see if NYC’s stricter policies on drug usage and offences are effective. - I need to add percent labels to the demographic graphs to interpret them.
Analysis: Question 2
To recap, this is my second research question:
Question 2
Suppose you were walking through a borough in New York City, and you see syringe litter. Does this necessarily mean that the borough in question is unsafe – or is this merely a bias that we have? In simpler terms: where are syringes likely to be found? Is there an overlap with where crimes (especially drug-related crimes) occur?
This question can be easily visualized via dot maps of NYC.
# creating subset of syringe for visualisation.syringe_map <- syringe %>%filter(!is.na(s_long)) %>%filter(!is.na(s_lat)) %>%group_by(s_totalsyringes)# visualisation of syringe_map using mapview.syringe %>%filter(!is.na(s_long)) %>%filter(!is.na(s_lat)) %>%group_by(s_totalsyringes) %>%mapview(xcol ="s_long", ycol ="s_lat", crs =4326, grid =FALSE)